In this work, I will consider a dataset about cancer survival in the U.S. counties:
cancer.dataset <- read.csv("cancer_reg.csv") %>% mutate(Geography = as.factor(Geography), binnedInc = as.factor(binnedInc))
cancer.dataset %>% skim()
| Name | Piped data |
| Number of rows | 3047 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| binnedInc | 0 | 1 | FALSE | 10 | (45: 306, (54: 306, [22: 306, (42: 305 |
| Geography | 0 | 1 | FALSE | 3047 | Abb: 1, Aca: 1, Acc: 1, Ada: 1 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| avgAnnCount | 0 | 1.00 | 606.34 | 1416.36 | 6.00 | 76.00 | 171.00 | 518.00 | 38150.00 | ▇▁▁▁▁ |
| avgDeathsPerYear | 0 | 1.00 | 185.97 | 504.13 | 3.00 | 28.00 | 61.00 | 149.00 | 14010.00 | ▇▁▁▁▁ |
| TARGET_deathRate | 0 | 1.00 | 178.66 | 27.75 | 59.70 | 161.20 | 178.10 | 195.20 | 362.80 | ▁▇▆▁▁ |
| incidenceRate | 0 | 1.00 | 448.27 | 54.56 | 201.30 | 420.30 | 453.55 | 480.85 | 1206.90 | ▂▇▁▁▁ |
| medIncome | 0 | 1.00 | 47063.28 | 12040.09 | 22640.00 | 38882.50 | 45207.00 | 52492.00 | 125635.00 | ▇▇▁▁▁ |
| popEst2015 | 0 | 1.00 | 102637.37 | 329059.22 | 827.00 | 11684.00 | 26643.00 | 68671.00 | 10170292.00 | ▇▁▁▁▁ |
| povertyPercent | 0 | 1.00 | 16.88 | 6.41 | 3.20 | 12.15 | 15.90 | 20.40 | 47.40 | ▃▇▃▁▁ |
| studyPerCap | 0 | 1.00 | 155.40 | 529.63 | 0.00 | 0.00 | 0.00 | 83.65 | 9762.31 | ▇▁▁▁▁ |
| MedianAge | 0 | 1.00 | 45.27 | 45.30 | 22.30 | 37.70 | 41.00 | 44.00 | 624.00 | ▇▁▁▁▁ |
| MedianAgeMale | 0 | 1.00 | 39.57 | 5.23 | 22.40 | 36.35 | 39.60 | 42.50 | 64.70 | ▁▇▇▁▁ |
| MedianAgeFemale | 0 | 1.00 | 42.15 | 5.29 | 22.30 | 39.10 | 42.40 | 45.30 | 65.70 | ▁▃▇▂▁ |
| AvgHouseholdSize | 0 | 1.00 | 2.48 | 0.43 | 0.02 | 2.37 | 2.50 | 2.63 | 3.97 | ▁▁▃▇▁ |
| PercentMarried | 0 | 1.00 | 51.77 | 6.90 | 23.10 | 47.75 | 52.40 | 56.40 | 72.50 | ▁▂▇▇▁ |
| PctNoHS18_24 | 0 | 1.00 | 18.22 | 8.09 | 0.00 | 12.80 | 17.10 | 22.70 | 64.10 | ▃▇▂▁▁ |
| PctHS18_24 | 0 | 1.00 | 35.00 | 9.07 | 0.00 | 29.20 | 34.70 | 40.70 | 72.50 | ▁▃▇▂▁ |
| PctSomeCol18_24 | 2285 | 0.25 | 40.98 | 11.12 | 7.10 | 34.00 | 40.40 | 46.40 | 79.00 | ▁▅▇▂▁ |
| PctBachDeg18_24 | 0 | 1.00 | 6.16 | 4.53 | 0.00 | 3.10 | 5.40 | 8.20 | 51.80 | ▇▁▁▁▁ |
| PctHS25_Over | 0 | 1.00 | 34.80 | 7.03 | 7.50 | 30.40 | 35.30 | 39.65 | 54.80 | ▁▂▇▇▁ |
| PctBachDeg25_Over | 0 | 1.00 | 13.28 | 5.39 | 2.50 | 9.40 | 12.30 | 16.10 | 42.20 | ▆▇▂▁▁ |
| PctEmployed16_Over | 152 | 0.95 | 54.15 | 8.32 | 17.60 | 48.60 | 54.50 | 60.30 | 80.10 | ▁▂▇▇▁ |
| PctUnemployed16_Over | 0 | 1.00 | 7.85 | 3.45 | 0.40 | 5.50 | 7.60 | 9.70 | 29.40 | ▅▇▁▁▁ |
| PctPrivateCoverage | 0 | 1.00 | 64.35 | 10.65 | 22.30 | 57.20 | 65.10 | 72.10 | 92.30 | ▁▂▇▇▂ |
| PctPrivateCoverageAlone | 609 | 0.80 | 48.45 | 10.08 | 15.70 | 41.00 | 48.70 | 55.60 | 78.90 | ▁▅▇▅▁ |
| PctEmpPrivCoverage | 0 | 1.00 | 41.20 | 9.45 | 13.50 | 34.50 | 41.10 | 47.70 | 70.70 | ▁▅▇▃▁ |
| PctPublicCoverage | 0 | 1.00 | 36.25 | 7.84 | 11.20 | 30.90 | 36.30 | 41.55 | 65.10 | ▁▅▇▂▁ |
| PctPublicCoverageAlone | 0 | 1.00 | 19.24 | 6.11 | 2.60 | 14.85 | 18.80 | 23.10 | 46.60 | ▂▇▆▁▁ |
| PctWhite | 0 | 1.00 | 83.65 | 16.38 | 10.20 | 77.30 | 90.06 | 95.45 | 100.00 | ▁▁▁▂▇ |
| PctBlack | 0 | 1.00 | 9.11 | 14.53 | 0.00 | 0.62 | 2.25 | 10.51 | 85.95 | ▇▁▁▁▁ |
| PctAsian | 0 | 1.00 | 1.25 | 2.61 | 0.00 | 0.25 | 0.55 | 1.22 | 42.62 | ▇▁▁▁▁ |
| PctOtherRace | 0 | 1.00 | 1.98 | 3.52 | 0.00 | 0.30 | 0.83 | 2.18 | 41.93 | ▇▁▁▁▁ |
| PctMarriedHouseholds | 0 | 1.00 | 51.24 | 6.57 | 22.99 | 47.76 | 51.67 | 55.40 | 78.08 | ▁▂▇▂▁ |
| BirthRate | 0 | 1.00 | 5.64 | 1.99 | 0.00 | 4.52 | 5.38 | 6.49 | 21.33 | ▂▇▁▁▁ |
These descriptions were directly taken from the dataset’s website:
BirthRate: Number of live births relative to number of women in county (b)
For Geography, each observation is relative to a county in the U.S., they are just like identifiers. The variables are divided mainly in two categories economy and anagraphic. The first category includes variables about employment, education, household size, health insurances and so on. The latter has data about the age of the people, the number of cancer diagnoses, their ethnicity (we will check for dependencies with income and ethnicity), marriages and similar.
We can prepare some scatterplots to see if there are dependencies between these variables:
pairs(cancer.dataset %>% select(
medIncome,
povertyPercent,
AvgHouseholdSize,
PctPrivateCoverage,
PctPrivateCoverageAlone,
PctEmpPrivCoverage,
PctPublicCoverage,
PctPublicCoverageAlone
))
* Anagraphic features:
pairs(cancer.dataset %>% select(
PctHS18_24,
PctSomeCol18_24,
binnedInc,
PctBachDeg18_24,
PctHS25_Over,
PctBachDeg25_Over,
PctEmployed16_Over,
PctUnemployed16_Over
))
pairs(cancer.dataset %>% select(
PctWhite,
PctBlack,
PctAsian,
PctOtherRace,
medIncome
))
> In this plot it is shown a correlation between
medIncome and the percentage of population with a certain ethnicity.
pairs(cancer.dataset %>% select(avgAnnCount, avgDeathsPerYear, incidenceRate, medIncome, popEst2015, povertyPercent))
Three columns have NA values in quite different proportions:
aggr(cancer.dataset)
We can impute a value for them using packages as
mice:
##
## iter imp variable
## 1 1 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 1 2 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 1 3 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 1 4 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 1 5 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 2 1 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 2 2 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 2 3 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 2 4 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 2 5 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 3 1 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 3 2 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 3 3 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 3 4 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 3 5 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 4 1 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 4 2 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 4 3 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 4 4 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 4 5 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 5 1 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 5 2 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 5 3 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 5 4 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
## 5 5 PctSomeCol18_24 PctEmployed16_Over PctPrivateCoverageAlone
df <- cancer.dataset %>% mutate(
PctSomeCol18_24 = df_filled$PctSomeCol18_24,
PctEmployed16_Over = df_filled$PctEmployed16_Over,
PctPrivateCoverageAlone = df_filled$PctPrivateCoverageAlone
)
cormat <- round(cor(df %>% select(-Geography, -binnedInc)), 2)
#no NA removal is needed as we imputed the missing data
#cormat <- cormat %>% apply(1:2, function(x){ifelse(is.na(x),0,x)})
ggcorrplot(cormat, hc.order = TRUE, outline.col = "white", tl.cex = 5.75)
As linear regression was a part of this course, we will find the best way to model NA values by using this technique manually.
The variables we have to impute are: PctSomeCol18_24, PctEmployed16_Over and PctPrivateCoverageAlone
Firstly, we remove the two factor variables and the other two columns with NAs as we will not use them for the imputation. We then keep only complete observations. We also remove the response variable to avoid creating artificial relationships between the imputed variables and the response.
df <- cancer.dataset %>% select(
- Geography,
- binnedInc,
- PctEmployed16_Over,
- PctPrivateCoverageAlone,
- TARGET_deathRate
) %>% filter(
!is.na(PctSomeCol18_24)
)
nrow(df)
## [1] 762
mod.full <- lm(PctSomeCol18_24 ~ ., data = df)
summary(mod.full)
##
## Call:
## lm(formula = PctSomeCol18_24 ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.116527 -0.010921 0.001025 0.012271 0.122172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+02 8.096e-02 1235.419 <2e-16 ***
## avgAnnCount -1.462e-06 4.309e-06 -0.339 0.734
## avgDeathsPerYear -1.544e-05 2.331e-05 -0.662 0.508
## incidenceRate 5.602e-05 4.212e-05 1.330 0.184
## medIncome 1.202e-07 5.160e-07 0.233 0.816
## popEst2015 3.989e-08 3.009e-08 1.326 0.185
## povertyPercent 1.032e-04 9.196e-04 0.112 0.911
## studyPerCap 2.638e-06 6.270e-06 0.421 0.674
## MedianAge 4.732e-05 4.565e-05 1.037 0.300
## MedianAgeMale 1.857e-03 1.195e-03 1.554 0.121
## MedianAgeFemale -1.852e-03 1.240e-03 -1.493 0.136
## AvgHouseholdSize 2.526e-03 5.389e-03 0.469 0.639
## PercentMarried -7.292e-04 9.074e-04 -0.804 0.422
## PctNoHS18_24 -9.993e-01 3.394e-04 -2944.706 <2e-16 ***
## PctHS18_24 -1.000e+00 2.833e-04 -3530.419 <2e-16 ***
## PctBachDeg18_24 -1.001e+00 6.815e-04 -1468.177 <2e-16 ***
## PctHS25_Over -6.381e-04 5.809e-04 -1.098 0.272
## PctBachDeg25_Over 2.105e-04 9.077e-04 0.232 0.817
## PctUnemployed16_Over -1.211e-03 9.749e-04 -1.242 0.215
## PctPrivateCoverage 2.560e-04 7.917e-04 0.323 0.746
## PctEmpPrivCoverage -5.603e-04 5.819e-04 -0.963 0.336
## PctPublicCoverage -1.482e-03 1.319e-03 -1.123 0.262
## PctPublicCoverageAlone 2.212e-03 1.717e-03 1.289 0.198
## PctWhite 3.156e-04 3.331e-04 0.948 0.344
## PctBlack -1.050e-04 3.177e-04 -0.331 0.741
## PctAsian -1.120e-03 1.206e-03 -0.928 0.353
## PctOtherRace -9.318e-04 7.828e-04 -1.190 0.234
## PctMarriedHouseholds -2.541e-04 8.537e-04 -0.298 0.766
## BirthRate 1.197e-03 1.211e-03 0.989 0.323
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05631 on 733 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.059e+06 on 28 and 733 DF, p-value: < 2.2e-16
We see that other features with values very related to this variable are considered as significant. However, it could be useful to apply strategies for feature selection to reduce the complexity of the model.
We apply forward an backward feature selection:
mod.fwd <- regsubsets(PctSomeCol18_24 ~ ., data = df, method = "forward")
mod.fwd.summary <- summary(mod.fwd)
print("BIC")
## [1] "BIC"
print(mod.fwd.summary$bic)
## [1] -637.3475 -1699.9792 -8024.3332 -8025.5098 -8022.2539 -8018.0429 -8012.6549
## [8] -8007.9224
print("R²-adj")
## [1] "R²-adj"
mod.fwd.summary$adjr2
## [1] 0.5736592 0.8950675 0.9999741 0.9999743 0.9999744 0.9999745 0.9999745
## [8] 0.9999745
plot(mod.fwd.summary$adjr2, main = "Forward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.fwd.summary$adjr2)
points(adjr2.max, mod.fwd.summary$adjr2[adjr2.max],
col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.fwd.summary$bic)
points(bic.min, mod.fwd.summary$adjr2[bic.min],
col = "blue", cex = 2, pch = 20)
mod.bkw <- regsubsets(PctSomeCol18_24 ~ ., data = df, method = "backward")
mod.bkw.summary <- summary(mod.bkw)
print("BIC")
## [1] "BIC"
print(mod.bkw.summary$bic)
## [1] -637.3475 -1699.9792 -8024.3332 -8025.5098 -8022.2539 -8016.5556 -8012.0985
## [8] -8007.3559
print("R²-adj")
## [1] "R²-adj"
mod.bkw.summary$adjr2
## [1] 0.5736592 0.8950675 0.9999741 0.9999743 0.9999744 0.9999744 0.9999745
## [8] 0.9999745
plot(mod.bkw.summary$adjr2, main = "Backward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.bkw.summary$adjr2)
points(adjr2.max, mod.bkw.summary$adjr2[adjr2.max],
col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.bkw.summary$bic)
points(bic.min, mod.bkw.summary$adjr2[bic.min],
col = "blue", cex = 2, pch = 20)
# use coef(mod.bkw, n) to extract the coefficients
The fourth model is the best in regards of BIC metric in both approaches, with the same exact score.
# helper function to predict run predicitons from coefficients
predict.from.coefficients <- function(df, coefs){
as.matrix(df[, names(coefs)[-1]]) %*% coefs[-1] + coefs[1]
}
We can then plot the squared error on the training data:
coefs <- coef(mod.bkw, 4)
print(coefs)
## (Intercept) popEst2015 PctNoHS18_24 PctHS18_24 PctBachDeg18_24
## 1.000031e+02 1.262458e-08 -9.996401e-01 -1.000229e+00 -1.000717e+00
preds <- predict.from.coefficients(df, coefs)
ggplot(data.frame(squared_error = (preds-df$PctSomeCol18_24)^2)) +
geom_density(aes(x = squared_error)) +
labs(title="Squared error distribution for PctSomeCol18_24 prediction")
# sd for stochastic imputation
std.dev.bkw.imp <- sd(df$PctSomeCol18_24)
We can now use this model to impute the NAs:
imputation <- predict.from.coefficients(cancer.dataset, coef(mod.bkw, 4)) + rnorm(1:nrow(cancer.dataset), mean = 0, sd = std.dev.bkw.imp)
cancer.dataset.filled <- cancer.dataset %>% mutate(PctSomeCol18_24 = ifelse(is.na(PctSomeCol18_24), imputation, PctSomeCol18_24))
ggplot(cancer.dataset.filled) +
geom_point(aes(x=PctSomeCol18_24, y=TARGET_deathRate)) +
geom_point(data=cancer.dataset, aes(x=PctSomeCol18_24, y=TARGET_deathRate), color="red", alpha = 0.5) +
labs(title = "College education (18-24) vs Death rate, red dots are the values in data")
We can now move on to the other two variables.
df <- cancer.dataset %>% select(
- Geography,
- binnedInc,
- PctSomeCol18_24,
- PctPrivateCoverageAlone,
- TARGET_deathRate
) %>% filter(
!is.na(PctEmployed16_Over)
)
nrow(df)
## [1] 2895
mod.full <- lm(PctEmployed16_Over ~ ., data = df)
summary(mod.full)
##
## Call:
## lm(formula = PctEmployed16_Over ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0053 -1.7858 0.1298 1.8743 14.0818
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.059e+01 2.268e+00 35.538 < 2e-16 ***
## avgAnnCount 6.768e-04 1.309e-04 5.169 2.51e-07 ***
## avgDeathsPerYear -2.502e-03 6.663e-04 -3.755 0.000177 ***
## incidenceRate -2.703e-03 1.289e-03 -2.097 0.036050 *
## medIncome 3.902e-07 1.379e-05 0.028 0.977425
## popEst2015 9.946e-07 9.342e-07 1.065 0.287121
## povertyPercent -5.024e-01 2.612e-02 -19.237 < 2e-16 ***
## studyPerCap 2.440e-04 1.131e-04 2.158 0.031043 *
## MedianAge 1.523e-03 1.361e-03 1.119 0.263244
## MedianAgeMale 2.569e-02 3.567e-02 0.720 0.471442
## MedianAgeFemale -2.894e-01 3.674e-02 -7.878 4.70e-15 ***
## AvgHouseholdSize 5.816e-01 1.647e-01 3.531 0.000421 ***
## PercentMarried 7.976e-01 2.516e-02 31.694 < 2e-16 ***
## PctNoHS18_24 -3.355e-02 9.534e-03 -3.519 0.000439 ***
## PctHS18_24 -5.025e-02 8.366e-03 -6.007 2.12e-09 ***
## PctBachDeg18_24 2.400e-02 1.843e-02 1.303 0.192834
## PctHS25_Over 6.649e-02 1.653e-02 4.024 5.88e-05 ***
## PctBachDeg25_Over 1.988e-01 2.661e-02 7.473 1.03e-13 ***
## PctUnemployed16_Over -4.452e-01 2.735e-02 -16.275 < 2e-16 ***
## PctPrivateCoverage 4.696e-03 2.239e-02 0.210 0.833876
## PctEmpPrivCoverage 1.188e-01 1.768e-02 6.718 2.21e-11 ***
## PctPublicCoverage -5.779e-01 3.715e-02 -15.555 < 2e-16 ***
## PctPublicCoverageAlone 5.327e-01 4.695e-02 11.345 < 2e-16 ***
## PctWhite -3.251e-02 9.910e-03 -3.280 0.001049 **
## PctBlack -1.356e-02 9.677e-03 -1.402 0.161105
## PctAsian -1.285e-01 3.184e-02 -4.036 5.57e-05 ***
## PctOtherRace 4.046e-03 2.106e-02 0.192 0.847704
## PctMarriedHouseholds -7.755e-01 2.402e-02 -32.287 < 2e-16 ***
## BirthRate 1.038e-01 3.263e-02 3.180 0.001488 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.219 on 2866 degrees of freedom
## Multiple R-squared: 0.8516, Adjusted R-squared: 0.8501
## F-statistic: 587.3 on 28 and 2866 DF, p-value: < 2.2e-16
Here we see how so many more variables have a significance this time:
mod.fwd <- regsubsets(PctEmployed16_Over ~ ., data = df, method = "forward", nvmax=20)
mod.fwd.summary <- summary(mod.fwd)
print("BIC")
## [1] "BIC"
print(mod.fwd.summary$bic)
## [1] -2599.157 -3373.836 -3590.351 -3885.055 -4183.619 -4951.858 -5108.831
## [8] -5206.403 -5264.775 -5290.649 -5307.317 -5315.604 -5321.236 -5328.088
## [15] -5327.320 -5329.351 -5332.527 -5329.289 -5326.009 -5321.594
print("R²-adj")
## [1] "R²-adj"
mod.fwd.summary$adjr2
## [1] 0.5946360 0.6905533 0.7135429 0.7418904 0.7677427 0.8223045 0.8320876
## [8] 0.8380428 0.8416572 0.8434434 0.8447163 0.8455323 0.8462030 0.8469353
## [15] 0.8472626 0.8477365 0.8482690 0.8484641 0.8486568 0.8487900
plot(mod.fwd.summary$adjr2, main = "Forward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.fwd.summary$adjr2)
points(adjr2.max, mod.fwd.summary$adjr2[adjr2.max],
col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.fwd.summary$bic)
points(bic.min, mod.fwd.summary$adjr2[bic.min],
col = "blue", cex = 2, pch = 20)
mod.bkw <- regsubsets(PctEmployed16_Over ~ ., data = df, method = "backward", nvmax=20)
mod.bkw.summary <- summary(mod.bkw)
print("BIC")
## [1] "BIC"
print(mod.bkw.summary$bic)
## [1] -2599.157 -2979.831 -3820.672 -4201.440 -4687.407 -4951.858 -5108.831
## [8] -5198.908 -5294.366 -5288.172 -5313.537 -5327.879 -5335.826 -5344.923
## [15] -5343.785 -5346.442 -5350.331 -5354.096 -5351.775 -5348.049
print("R²-adj")
## [1] "R²-adj"
mod.bkw.summary$adjr2
## [1] 0.5946360 0.6454376 0.7354498 0.7686117 0.8048388 0.8223045 0.8320876
## [8] 0.8376230 0.8432675 0.8433093 0.8450495 0.8461858 0.8469761 0.8478228
## [15] 0.8481289 0.8486328 0.8491992 0.8497571 0.8499978 0.8501655
plot(mod.bkw.summary$adjr2, main = "Backward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.bkw.summary$adjr2)
points(adjr2.max, mod.bkw.summary$adjr2[adjr2.max],
col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.bkw.summary$bic)
points(bic.min, mod.bkw.summary$adjr2[bic.min],
col = "blue", cex = 2, pch = 20)
For this feature, we test our models also with cross validation using caret package:
ten.folds <- trainControl(method = "cv", number = 10)
# leave.one.out <- trainControl(method = "LOOCV")
ten.folds.models <- list()
ten.folds.models[[21]] <- train(PctEmployed16_Over ~ ., data = df, method = "lm", trControl = ten.folds)
# leave.one.out.model <- train(PctEmployed16_Over ~ ., data = df, method = "lm", trControl = leave.one.out)
outcomes <- cancer.dataset %>% filter(!is.na(PctEmployed16_Over)) %>% pull(PctEmployed16_Over)
for(i in 1:20){
if(i > 1)
ten.folds.models[[i]] <- train(y = outcomes, x = as.matrix(df[, names(coef(mod.bkw, i))[-1]]), method = "lm", trControl = ten.folds)
else
ten.folds.models[[i]] <- train(y = outcomes, x = df[, names(coef(mod.bkw, 1))[-1], drop=FALSE], method = "lm", trControl = ten.folds)
}
rsqrs <- 1:21
for(i in 1:21){
rsqrs[i] <- ten.folds.models[[i]]$results$Rsquared
}
plot(1:21, rsqrs, main = "Distribution of R² w.r.t the number of used variables")
points(which.max(rsqrs), max(rsqrs), col = "red", cex = 2, pch = 20)
rmses <- 1:21
for(i in 1:21){
rmses[i] <- ten.folds.models[[i]]$results$RMSE
}
plot(1:21, rmses, main = "Distribution of RMSE w.r.t the number of used variables")
points(which.min(rmses), min(rmses), col = "red", cex = 2, pch = 20)
bics <- 1:21
for(i in 1:21){
bics[i] <- BIC(ten.folds.models[[i]]$finalModel)
}
plot(1:21, bics, main = "Distribution of BIC w.r.t the number of used variables")
points(which.min(bics), min(bics), col = "red", cex = 2, pch = 20)
Note that the last point to the right in the previous plots was computed by using all the variables. Using BIC, the best model is achieved with 18 variables also after CV, this is reflected in the R-squared.
coefs <- ten.folds.models[[18]]$finalModel$coefficients
print(coefs)
## (Intercept) avgAnnCount avgDeathsPerYear
## 79.1608747576 0.0006948743 -0.0019189410
## povertyPercent MedianAgeFemale AvgHouseholdSize
## -0.5100425805 -0.2725039006 0.6493057891
## PercentMarried PctNoHS18_24 PctHS18_24
## 0.8025860025 -0.0325926174 -0.0516786056
## PctHS25_Over PctBachDeg25_Over PctUnemployed16_Over
## 0.0636106665 0.2114370870 -0.4567802805
## PctEmpPrivCoverage PctPublicCoverage PctPublicCoverageAlone
## 0.1124319697 -0.5774574496 0.5300125134
## PctWhite PctAsian PctMarriedHouseholds
## -0.0208750381 -0.0993396784 -0.7759655151
## BirthRate
## 0.1094950529
preds <- predict.from.coefficients(df, coefs)
ggplot(data.frame(squared_error = (preds-df$PctEmployed16_Over)^2)) +
geom_density(aes(x = squared_error))
# sd for stochastic imputation
std.dev.bkw.imp <- sd(df$PctEmployed16_Over)
we can now impute the NAs:
imputation <- predict.from.coefficients(cancer.dataset, coefs) + rnorm(1:nrow(cancer.dataset), mean = 0, sd = std.dev.bkw.imp)
cancer.dataset.filled <- cancer.dataset.filled %>% mutate(PctEmployed16_Over = ifelse(is.na(PctEmployed16_Over), imputation, PctEmployed16_Over))
ggplot(cancer.dataset.filled) +
geom_point(aes(x=PctEmployed16_Over, y=TARGET_deathRate)) +
geom_point(data=cancer.dataset, aes(x=PctEmployed16_Over, y=TARGET_deathRate), color="red", alpha=0.5) +
labs(title = "Employment (>16) vs Death rate, red dots are the original values")
Finally we repeat the first procedure on this last variable:
df <- cancer.dataset %>% select(
- Geography,
- binnedInc,
- PctEmployed16_Over,
- PctSomeCol18_24,
- TARGET_deathRate
) %>% filter(
!is.na(PctPrivateCoverageAlone)
)
nrow(df)
## [1] 2438
mod.full <- lm(PctPrivateCoverageAlone ~ ., data = df)
summary(mod.full)
##
## Call:
## lm(formula = PctPrivateCoverageAlone ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9186 -0.5239 0.0172 0.5529 6.5346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.316e+00 8.535e-01 1.542 0.123267
## avgAnnCount 7.877e-05 4.941e-05 1.594 0.111022
## avgDeathsPerYear -8.305e-04 2.581e-04 -3.218 0.001307 **
## incidenceRate -3.400e-04 4.638e-04 -0.733 0.463529
## medIncome -9.107e-06 5.192e-06 -1.754 0.079540 .
## popEst2015 1.063e-06 3.559e-07 2.987 0.002848 **
## povertyPercent 5.054e-02 9.972e-03 5.068 4.32e-07 ***
## studyPerCap -3.557e-05 4.148e-05 -0.858 0.391199
## MedianAge -5.615e-04 5.155e-04 -1.089 0.276214
## MedianAgeMale 1.817e-02 1.337e-02 1.359 0.174329
## MedianAgeFemale 2.692e-02 1.395e-02 1.930 0.053719 .
## AvgHouseholdSize -7.919e-02 6.129e-02 -1.292 0.196412
## PercentMarried 7.462e-03 9.698e-03 0.769 0.441738
## PctNoHS18_24 -5.658e-04 3.637e-03 -0.156 0.876382
## PctHS18_24 -5.174e-03 3.207e-03 -1.614 0.106762
## PctBachDeg18_24 -5.447e-03 7.086e-03 -0.769 0.442121
## PctHS25_Over -3.167e-03 6.336e-03 -0.500 0.617254
## PctBachDeg25_Over 4.307e-02 9.971e-03 4.320 1.63e-05 ***
## PctUnemployed16_Over -2.988e-02 1.047e-02 -2.853 0.004365 **
## PctPrivateCoverage 7.217e-01 8.500e-03 84.902 < 2e-16 ***
## PctEmpPrivCoverage 2.181e-01 6.744e-03 32.342 < 2e-16 ***
## PctPublicCoverage -7.103e-01 1.415e-02 -50.181 < 2e-16 ***
## PctPublicCoverageAlone 7.288e-01 1.786e-02 40.809 < 2e-16 ***
## PctWhite 1.543e-02 3.752e-03 4.111 4.07e-05 ***
## PctBlack 1.220e-02 3.603e-03 3.385 0.000722 ***
## PctAsian 1.843e-03 1.244e-02 0.148 0.882236
## PctOtherRace -2.181e-02 7.721e-03 -2.825 0.004767 **
## PctMarriedHouseholds -1.894e-03 9.242e-03 -0.205 0.837630
## BirthRate -2.340e-02 1.219e-02 -1.919 0.055128 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.125 on 2409 degrees of freedom
## Multiple R-squared: 0.9877, Adjusted R-squared: 0.9875
## F-statistic: 6900 on 28 and 2409 DF, p-value: < 2.2e-16
mod.fwd <- regsubsets(PctPrivateCoverageAlone ~ ., data = df, method = "forward", nvmax = 20)
mod.fwd.summary <- summary(mod.fwd)
print("BIC")
## [1] "BIC"
print(mod.fwd.summary$bic)
## [1] -4985.132 -7236.487 -8839.152 -10457.029 -10482.244 -10523.640
## [7] -10545.292 -10576.493 -10588.916 -10593.460 -10590.991 -10587.245
## [13] -10583.144 -10584.446 -10580.583 -10575.749 -10570.498 -10565.018
## [19] -10558.867 -10552.382
print("R²-adj")
## [1] "R²-adj"
mod.fwd.summary$adjr2
## [1] 0.8713617 0.9490536 0.9736724 0.9864793 0.9866557 0.9869169 0.9870687
## [8] 0.9872686 0.9873686 0.9874272 0.9874495 0.9874652 0.9874790 0.9875205
## [15] 0.9875355 0.9875455 0.9875534 0.9875601 0.9875633 0.9875649
plot(mod.fwd.summary$adjr2, main = "Forward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.fwd.summary$adjr2)
points(adjr2.max, mod.fwd.summary$adjr2[adjr2.max],
col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.fwd.summary$bic)
points(bic.min, mod.fwd.summary$adjr2[bic.min],
col = "blue", cex = 2, pch = 20)
mod.bkw <- regsubsets(PctPrivateCoverageAlone ~ ., data = df, method = "backward", nvmax = 20)
mod.bkw.summary <- summary(mod.bkw)
print("BIC")
## [1] "BIC"
print(mod.bkw.summary$bic)
## [1] -4985.132 -7031.479 -9535.493 -10457.029 -10482.244 -10523.640
## [7] -10545.292 -10576.493 -10588.916 -10593.460 -10588.143 -10588.650
## [13] -10588.009 -10584.897 -10580.583 -10575.749 -10570.498 -10565.018
## [19] -10558.867 -10552.382
print("R²-adj")
## [1] "R²-adj"
mod.bkw.summary$adjr2
## [1] 0.8713617 0.9445844 0.9802136 0.9864793 0.9866557 0.9869169 0.9870687
## [8] 0.9872686 0.9873686 0.9874272 0.9874348 0.9874724 0.9875040 0.9875228
## [15] 0.9875355 0.9875455 0.9875534 0.9875601 0.9875633 0.9875649
plot(mod.bkw.summary$adjr2, main = "Backward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.bkw.summary$adjr2)
points(adjr2.max, mod.bkw.summary$adjr2[adjr2.max],
col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.bkw.summary$bic)
points(bic.min, mod.bkw.summary$adjr2[bic.min],
col = "blue", cex = 2, pch = 20)
# use coef(mod.bkw, n) to extract the coefficients
The best model by BIC has 10 variables (we use the backword strategy selection even if the forward one gave equivalent results)
predict.from.coefficients <- function(df, coefs){
as.matrix(df[, names(coefs)[-1]]) %*% coefs[-1] + coefs[1]
}
coefs <- coef(mod.bkw, 10)
print(coefs)
## (Intercept) povertyPercent MedianAgeFemale
## -0.32445181 0.06280057 0.04163113
## PctBachDeg25_Over PctUnemployed16_Over PctPrivateCoverage
## 0.04674681 -0.03462309 0.72354376
## PctEmpPrivCoverage PctPublicCoverage PctPublicCoverageAlone
## 0.21387879 -0.70265761 0.71934483
## PctWhite PctBlack
## 0.02059450 0.01491203
preds <- predict.from.coefficients(df, coefs)
ggplot(data.frame(squared_error = (preds-df$PctPrivateCoverageAlone)^2)) +
geom_density(aes(x = squared_error))
# sd for stochastic imputation
std.dev.bkw.imp <- sd(df$PctPrivateCoverageAlone)
imputation <- predict.from.coefficients(cancer.dataset, coefs) + rnorm(1:nrow(cancer.dataset), mean = 0, sd = std.dev.bkw.imp)
cancer.dataset.filled <- cancer.dataset.filled %>% mutate(PctPrivateCoverageAlone = ifelse(is.na(PctPrivateCoverageAlone), imputation, PctPrivateCoverageAlone))
ggplot(cancer.dataset.filled) +
geom_point(aes(x=PctPrivateCoverageAlone, y=TARGET_deathRate)) +
geom_point(data=cancer.dataset, aes(x=PctPrivateCoverageAlone, y=TARGET_deathRate), color="red", alpha = 0.5)
Now we have prepared a dataset without missng values
aggr(cancer.dataset.filled)
cormat <- round(cor(cancer.dataset.filled %>% select(-Geography, -binnedInc)), 2)
ggcorrplot(cormat, hc.order = TRUE, outline.col = "white", tl.cex = 5.5)
The correlation matrix looks very similar to the one created doing imputation with
mice package.
We now want to create a linear regression model for the cancer death rate. In doing so we are also interested in which are the most relevant features in this context and, to this end, we will use the lasso methodology.
cancer.dataset.filled <- cancer.dataset.filled %>% select(-Geography, -binnedInc)
We firstly plot the summary of the a full linear model and some helpful scatterplots:
mod.full <- lm(TARGET_deathRate ~ ., data = cancer.dataset.filled)
summary(mod.full)
##
## Call:
## lm(formula = TARGET_deathRate ~ ., data = cancer.dataset.filled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99.149 -10.747 -0.399 10.571 134.204
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.624e+02 1.568e+01 10.359 < 2e-16 ***
## avgAnnCount -3.248e-03 7.709e-04 -4.214 2.59e-05 ***
## avgDeathsPerYear 1.750e-02 3.886e-03 4.504 6.91e-06 ***
## incidenceRate 1.915e-01 7.204e-03 26.587 < 2e-16 ***
## medIncome 1.117e-04 8.010e-05 1.395 0.163128
## popEst2015 -1.517e-05 5.453e-06 -2.783 0.005422 **
## povertyPercent 3.817e-01 1.605e-01 2.378 0.017467 *
## studyPerCap 1.705e-05 6.745e-04 0.025 0.979829
## MedianAge -2.160e-03 7.795e-03 -0.277 0.781746
## MedianAgeMale -4.929e-01 2.078e-01 -2.372 0.017774 *
## MedianAgeFemale -1.123e-01 2.162e-01 -0.519 0.603692
## AvgHouseholdSize 8.048e-01 9.557e-01 0.842 0.399823
## PercentMarried 1.305e+00 1.649e-01 7.912 3.53e-15 ***
## PctNoHS18_24 -1.488e-01 6.717e-02 -2.215 0.026863 *
## PctHS18_24 2.095e-01 6.240e-02 3.357 0.000797 ***
## PctSomeCol18_24 -6.971e-03 3.614e-02 -0.193 0.847055
## PctBachDeg18_24 -9.587e-02 1.143e-01 -0.839 0.401691
## PctHS25_Over 4.132e-01 9.628e-02 4.291 1.83e-05 ***
## PctBachDeg25_Over -1.157e+00 1.541e-01 -7.511 7.68e-14 ***
## PctEmployed16_Over -5.098e-01 9.479e-02 -5.378 8.10e-08 ***
## PctUnemployed16_Over 1.415e-01 1.653e-01 0.856 0.391903
## PctPrivateCoverage -4.476e-01 1.421e-01 -3.151 0.001643 **
## PctPrivateCoverageAlone -3.438e-02 8.010e-02 -0.429 0.667825
## PctEmpPrivCoverage 3.836e-01 1.049e-01 3.657 0.000260 ***
## PctPublicCoverage -2.826e-01 2.285e-01 -1.236 0.216424
## PctPublicCoverageAlone 3.699e-01 2.827e-01 1.308 0.190946
## PctWhite -1.275e-01 5.711e-02 -2.232 0.025673 *
## PctBlack -4.774e-02 5.517e-02 -0.865 0.386969
## PctAsian -4.135e-02 1.879e-01 -0.220 0.825853
## PctOtherRace -8.903e-01 1.236e-01 -7.201 7.54e-13 ***
## PctMarriedHouseholds -1.304e+00 1.581e-01 -8.246 2.41e-16 ***
## BirthRate -8.799e-01 1.926e-01 -4.569 5.10e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.26 on 3015 degrees of freedom
## Multiple R-squared: 0.5234, Adjusted R-squared: 0.5185
## F-statistic: 106.8 on 31 and 3015 DF, p-value: < 2.2e-16
df <- cancer.dataset.filled %>% pivot_longer(!starts_with("TARGET"), values_to = "vals", names_to = "vars")
for(c in colnames(cancer.dataset.filled)){
if(c != "TARGET_deathRate"){
p <- ggplot(df %>% filter(vars == !!c)) +
geom_point(aes(x=vals, y=TARGET_deathRate), alpha = 0.5) +
labs(x = c, title = paste(c,"vs death rate"))
print(p)
}
}
We notice that the variable MedianAge has some outliers with impossibly high values. It is likely that those points need to be divided by 10, similarly, some values of AvgHouseholdSize need to be multiplied by 100.
cancer.dataset.filled <- cancer.dataset.filled %>% mutate(MedianAge = ifelse(MedianAge>100, MedianAge/10, MedianAge))
cancer.dataset.filled <- cancer.dataset.filled %>% mutate(AvgHouseholdSize = ifelse(AvgHouseholdSize<1, AvgHouseholdSize*100, AvgHouseholdSize))
df <- cancer.dataset.filled %>% pivot_longer(!starts_with("TARGET"), values_to = "vals", names_to = "vars")
ggplot(df %>% filter(vars == "MedianAge")) +
geom_point(aes(x=vals, y=TARGET_deathRate), alpha = 0.5) +
labs(x = "MedianAge", title = paste("MedianAge","vs death rate"))
ggplot(df %>% filter(vars == "AvgHouseholdSize")) +
geom_point(aes(x=vals, y=TARGET_deathRate), alpha = 0.5) +
labs(x = "AvgHouseholdSize", title = paste("AvgHouseholdSize","vs death rate"))
We compute the lasso model:
y <- cancer.dataset.filled$TARGET_deathRate
X <- cancer.dataset.filled %>% select(-TARGET_deathRate)
cv.out <- cv.glmnet(y = y, x = as.matrix(X), alpha = 1)
plot(cv.out)
bestlamda <- cv.out$lambda.min
bestlamda
## [1] 0.04210501
lasso.model <- glmnet(y=y, x=X, alpha = 1, lambda = bestlamda)
and these are the computed coefficients:
coef(lasso.model)
## 32 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 1.885567e+02
## avgAnnCount -2.788806e-03
## avgDeathsPerYear 1.241877e-02
## incidenceRate 1.908705e-01
## medIncome 9.329769e-05
## popEst2015 -8.987387e-06
## povertyPercent 2.731418e-01
## studyPerCap .
## MedianAge -5.917602e-03
## MedianAgeMale -4.163005e-01
## MedianAgeFemale -3.314116e-01
## AvgHouseholdSize -8.180784e+00
## PercentMarried 8.632295e-01
## PctNoHS18_24 -1.159079e-01
## PctHS18_24 2.408217e-01
## PctSomeCol18_24 .
## PctBachDeg18_24 -4.393079e-02
## PctHS25_Over 3.931600e-01
## PctBachDeg25_Over -1.167440e+00
## PctEmployed16_Over -3.951473e-01
## PctUnemployed16_Over 2.580653e-01
## PctPrivateCoverage -5.814308e-01
## PctPrivateCoverageAlone -6.203856e-03
## PctEmpPrivCoverage 3.721590e-01
## PctPublicCoverage -1.302624e-01
## PctPublicCoverageAlone 2.210668e-01
## PctWhite -1.394536e-01
## PctBlack -4.006803e-02
## PctAsian .
## PctOtherRace -8.534669e-01
## PctMarriedHouseholds -8.026494e-01
## BirthRate -8.129336e-01
As the lasso method does not filter many variables, we decide do a pre-filtering with backward feature selection:
mod.bkw <- regsubsets(TARGET_deathRate ~ ., data = cancer.dataset.filled, method = "backward", nvmax = 20)
mod.bkw.summary <- summary(mod.bkw)
print("BIC")
## [1] "BIC"
print(mod.bkw.summary$bic)
## [1] -802.9273 -1644.3033 -1829.8624 -1916.9769 -1961.9616 -1995.1886
## [7] -2029.2523 -2039.9059 -2043.0725 -2071.0381 -2086.0472 -2084.0338
## [13] -2093.5109 -2099.3634 -2102.2901 -2102.0906 -2100.4461 -2096.0825
## [19] -2091.1504 -2086.2792
print("R²-adj")
## [1] "R²-adj"
mod.bkw.summary$adjr2
## [1] 0.2354372 0.4212519 0.4566986 0.4732269 0.4821414 0.4889366 0.4957811
## [8] 0.4986971 0.5003700 0.5060737 0.5096315 0.5104362 0.5130793 0.5151315
## [15] 0.5167113 0.5177913 0.5186407 0.5190595 0.5193882 0.5197263
plot(mod.bkw.summary$adjr2, main = "Backward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.bkw.summary$adjr2)
points(adjr2.max, mod.bkw.summary$adjr2[adjr2.max],
col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.bkw.summary$bic)
points(bic.min, mod.bkw.summary$adjr2[bic.min],
col = "blue", cex = 2, pch = 20)
to then apply the lasso technique again:
coef(mod.bkw, 15)
## (Intercept) avgAnnCount avgDeathsPerYear
## 185.347161100 -0.003515612 0.008580258
## incidenceRate MedianAgeMale PercentMarried
## 0.196763938 -0.768055787 1.208040728
## PctHS18_24 PctHS25_Over PctBachDeg25_Over
## 0.249120176 0.364366022 -1.087705370
## PctEmployed16_Over PctPrivateCoverage PctEmpPrivCoverage
## -0.554605910 -0.661982733 0.497177431
## PctWhite PctOtherRace PctMarriedHouseholds
## -0.103381638 -0.931121653 -1.285363044
## BirthRate
## -0.870231984
y <- cancer.dataset.filled$TARGET_deathRate
X <- cancer.dataset.filled %>% select(-TARGET_deathRate)
X <- X[, names(coef(mod.bkw, 15)[-1])]
cv.out <- cv.glmnet(y = y, x = as.matrix(X), alpha = 1)
plot(cv.out)
bestlamda <- cv.out$lambda.min
bestlamda
## [1] 0.01256265
lasso.model <- glmnet(y=y, x=X, alpha = 1, lambda = bestlamda)
coef(lasso.model)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 184.585273409
## avgAnnCount -0.003342230
## avgDeathsPerYear 0.008101041
## incidenceRate 0.196808007
## MedianAgeMale -0.749289236
## PercentMarried 1.150298205
## PctHS18_24 0.249152239
## PctHS25_Over 0.363376116
## PctBachDeg25_Over -1.090545980
## PctEmployed16_Over -0.538866760
## PctPrivateCoverage -0.655266381
## PctEmpPrivCoverage 0.480646968
## PctWhite -0.100804637
## PctOtherRace -0.928276429
## PctMarriedHouseholds -1.243061786
## BirthRate -0.861237859
Obtaining the following values for MSE, RMSE and MAE. In this case we used the whole (training) dataset:
y_est <- predict(lasso.model, as.matrix(X))
print("MSE:")
## [1] "MSE:"
print(sum((y-y_est)^2/nrow(X)))
## [1] 370.2728
print("RMSE:")
## [1] "RMSE:"
print(sqrt(sum((y-y_est)^2/nrow(X))))
## [1] 19.24247
print("MAE:")
## [1] "MAE:"
print(sum(abs(y-y_est)/nrow(X)))
## [1] 14.20949
Finally, can have a better understanding of the model’s performance using cross validation:
k <- 10
y <- cancer.dataset.filled$TARGET_deathRate
X <- cancer.dataset.filled %>% select(-TARGET_deathRate)
fold_group <- runif(nrow(X), min = 1, max = k+1) %>% trunc()
X <- X %>% mutate(fold_group = fold_group)
error <- 0
error_abs <- 0
for(fold in 1:k){
# --- TRAIN DATASET
# - use backward method
df_train <- cancer.dataset.filled %>% filter(fold_group != fold)
y <- df_train$TARGET_deathRate
X <- df_train %>% select(-TARGET_deathRate)
mod.bkw <- regsubsets(TARGET_deathRate ~ ., data = cancer.dataset.filled, method = "backward", nvmax = 20)
mod.bkw.summary <- summary(mod.bkw)
bic.min <- which.min(mod.bkw.summary$bic)
# - use lasso
# filter unused variables
X <- X[, names(coef(mod.bkw, bic.min)[-1])]
# compute lambda
cv.out <- cv.glmnet(y = y, x = as.matrix(X), alpha = 1)
bestlamda <- cv.out$lambda.min
lasso.model <- glmnet(y=y, x=X, alpha = 1, lambda = bestlamda)
# --- TEST DATASET
df_test <- cancer.dataset.filled %>% filter(fold_group == fold)
y <- df_test$TARGET_deathRate
X <- df_test %>% select(-TARGET_deathRate)
X <- X[, (coef(lasso.model)[,1] %>% names())[-1]] # used variables
y_est <- predict(lasso.model, as.matrix(X))
# --- COMPUTE METRICS
error <- error + sum((y-y_est)^2)
print(paste("MSE of fold ", fold, " :"))
print(sum((y-y_est)^2)/nrow(X))
error_abs <- error_abs + sum(sum(abs(y-y_est)))
print(paste("MAE of fold ", fold, " :"))
print(sum(abs(y-y_est))/nrow(X))
}
## [1] "MSE of fold 1 :"
## [1] 329.1637
## [1] "MAE of fold 1 :"
## [1] 13.97176
## [1] "MSE of fold 2 :"
## [1] 402.5424
## [1] "MAE of fold 2 :"
## [1] 15.01391
## [1] "MSE of fold 3 :"
## [1] 302.2367
## [1] "MAE of fold 3 :"
## [1] 13.20454
## [1] "MSE of fold 4 :"
## [1] 435.5085
## [1] "MAE of fold 4 :"
## [1] 15.2009
## [1] "MSE of fold 5 :"
## [1] 323.8388
## [1] "MAE of fold 5 :"
## [1] 13.18518
## [1] "MSE of fold 6 :"
## [1] 380.3584
## [1] "MAE of fold 6 :"
## [1] 14.74062
## [1] "MSE of fold 7 :"
## [1] 340.5263
## [1] "MAE of fold 7 :"
## [1] 13.86783
## [1] "MSE of fold 8 :"
## [1] 454.5657
## [1] "MAE of fold 8 :"
## [1] 14.75693
## [1] "MSE of fold 9 :"
## [1] 412.1573
## [1] "MAE of fold 9 :"
## [1] 15.25759
## [1] "MSE of fold 10 :"
## [1] 365.3386
## [1] "MAE of fold 10 :"
## [1] 13.82159
print("========================================")
## [1] "========================================"
print("FINAL MSE")
## [1] "FINAL MSE"
print(error/nrow(cancer.dataset.filled))
## [1] 375.5833
print("FINAL RMSE")
## [1] "FINAL RMSE"
print(sqrt(error/nrow(cancer.dataset.filled)))
## [1] 19.37997
print("FINAL MAE")
## [1] "FINAL MAE"
print(error_abs/nrow(cancer.dataset.filled))
## [1] 14.31538
Sowing how the previous values were (a little) optimistic, as expected. Such error is not that large when considering the distribution of the response variable:
ggplot(cancer.dataset.filled) +
geom_histogram(aes(x=TARGET_deathRate), fill="white", color="black") +
labs(title = "Death rate distribution")
We have shown how it is possible to use linear regression combined with lasso and backward feature selection to create a linear regression model to predict the death rates of cancer given economic and anagraphic data.